Introduction
In this notebook, I will combine the materials from all 5 sessions of class to create a complete analysis of a data set. I will use the “sleep duration” data set* that compares sleep duration with having a tv in one’s bedroom and smartphone use before bed for teenagers.
*This data set was received in PUBH 6862
Read in libraries
Read in data
Data Dictionary
We have \(4\) variables, one in each column of the data set.
The properties of the variables are tabled below.
| Variable | Explanation | Properties |
|---|---|---|
| \(\texttt{sleep}\) | Hours of sleep per night | hours |
| \(\texttt{age}\) | Age of participant | years |
| \(\texttt{tv}\) | TV in Bedroom | 0 if no and 1 yes |
| \(\texttt{smartphone}\) | Smartphone before bed | 0 if no and 1 yes |
Explore data
# A tibble: 6 × 4
sleep tv age smartphone
<dbl> <dbl> <dbl> <dbl>
1 7.8 1 18 0
2 8.6 0 18 1
3 8.7 1 18 0
4 7.2 1 13 1
5 6.3 1 18 1
6 7.6 0 18 1
sleep tv age smartphone
Min. : 5.200 Min. :0.00 Min. :11.00 Min. :0.00
1st Qu.: 7.100 1st Qu.:0.00 1st Qu.:14.00 1st Qu.:0.00
Median : 7.800 Median :0.00 Median :15.50 Median :0.00
Mean : 7.884 Mean :0.42 Mean :15.50 Mean :0.36
3rd Qu.: 8.700 3rd Qu.:1.00 3rd Qu.:17.75 3rd Qu.:1.00
Max. :10.800 Max. :1.00 Max. :19.00 Max. :1.00
Summary Statistics
vars n mean sd median trimmed mad min max range skew
sleep 1 50 7.88 1.27 7.8 7.88 1.26 5.2 10.8 5.6 0.14
tv 2 50 0.42 0.50 0.0 0.40 0.00 0.0 1.0 1.0 0.31
age 3 50 15.50 2.11 15.5 15.55 2.22 11.0 19.0 8.0 -0.07
smartphone 4 50 0.36 0.48 0.0 0.32 0.00 0.0 1.0 1.0 0.57
kurtosis se
sleep -0.35 0.18
tv -1.94 0.07
age -1.22 0.30
smartphone -1.71 0.07
Research questions
Primary research question
Are \(\texttt{tv}\) and \(\texttt{smartphone}\) associated with \(\texttt{sleep}\)?
Describe
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 50 7.88 1.27 7.8 7.88 1.26 5.2 10.8 5.6 0.14 -0.35 0.18
The average number of hours of sleep per night is 7.88 with a standard deviation of 1.27. The median number of hours of sleep a night is 7.8. The minimum number of hours of sleep per night is 5.2 and the maximum is 10.8
Update labels of categorical variables
Descriptive stats for sleep by the 2 groups
Descriptive statistics by group
group: No
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 29 8.32 1.16 8.5 8.26 1.19 6.4 10.8 4.4 0.36 -0.68 0.22
------------------------------------------------------------
group: Yes
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 21 7.28 1.18 7.2 7.28 1.19 5.2 9.5 4.3 0.02 -0.92 0.26
Descriptive statistics by group
group: No
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 32 8.07 1.25 8.2 8.1 1.33 5.2 10.8 5.6 -0.2 -0.48 0.22
------------------------------------------------------------
group: Yes
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 18 7.54 1.26 7.25 7.47 0.89 5.4 10.8 5.4 0.74 0.54 0.3
Data Visualization
sleep_tv <- (sleep %>% ggplot2::ggplot(
aes(
x = tv,
y = sleep
)
) +
ggplot2::geom_boxplot(
aes(
fill = tv
),
show.legend = FALSE
) +
ggplot2::labs(
title = "Distribution of Sleep Hours",
subtitle = "Comparison between TV and no TV in room"
) +
ggplot2::scale_fill_manual(
values = c("Yes" = "#17becf", "No" = "#e377c2")
) +
ggplot2::xlab("TV in Room") +
ggplot2::ylab("Sleep Duration") +
ggthemes::theme_clean());
plotly::ggplotly(sleep_tv)sleep_phone <- (sleep %>% ggplot2::ggplot(
aes(
x = smartphone,
y = sleep
)
) +
ggplot2::geom_boxplot(
aes(
fill = smartphone
),
show.legend = FALSE
) +
ggplot2::labs(
title = "Distribution of Sleep Hours",
subtitle = "Comparison between Phone Use Before Bed and No Phone Use Before Bed"
) +
ggplot2::scale_fill_manual(
values = c("Yes" = "#17becf", "No" = "#e377c2")
)+
ggplot2::xlab("Phone Use Before Bed") +
ggplot2::ylab("Sleep Duration") +
ggthemes::theme_clean());
plotly::ggplotly(sleep_phone)Inferential Statistics
Comparing a continuous variable between two groups can be conducted using a t test if the assumptions for the use parametric tests are met.
The assumptions are (1) normality and (2) Equality of Variances
Normality
Here, we will use the Shapiro-Wilk test to determine if the continuous variable is from a population in which the values are normally distributed. Under the null hypothesis for this test, the variable can be described by a normal distribution. We will set a level of significance of \(\alpha=0.05\) throughout.
Below, we use the filter and select verbs from the dplyr library and the the pull function that returns a vector of values. We use a pipeline to pipe the vector to the shapiro.wilk function. We start with the sleep duration of those without a TV in their room.
Shapiro-Wilk normality test
data: .
W = 0.96206, p-value = 0.369
Now with a TV
Shapiro-Wilk normality test
data: .
W = 0.98242, p-value = 0.9557
In both cases, we fail to reject the null hypothesis. We can state that the variable is normally distributed in the population.
Now we will check the normality of sleep duration for those who use a smartphone before bed.
Shapiro-Wilk normality test
data: .
W = 0.98243, p-value = 0.8662
Shapiro-Wilk normality test
data: .
W = 0.94375, p-value = 0.3355
Again, in both cases, we fail to reject the null hypothesis. We can state that the variable is normally distributed in the population.
Equality of Variances
The other test that we will perform, is Bartlett’s test. The null hypothesis is that we have equal variance for the continuous variable comparing the two groups. The bartlett.test function performs this test.
First, we will test TV in room
Bartlett test of homogeneity of variances
data: sleep$sleep and sleep$tv
Bartlett's K-squared = 0.0058349, df = 1, p-value = 0.9391
Next, we will test smartphone use before bed
Bartlett test of homogeneity of variances
data: sleep$sleep and sleep$smartphone
Bartlett's K-squared = 0.00064119, df = 1, p-value = 0.9798
In both cases, we fail to reject the null hypothesis and can use an equal variance t test. This test is performed using the t.test function
Statistical Analysis
Is there a difference in average sleep duration between those with and without a TV in their bedroom?
Two Sample t-test
data: sleep by tv
t = 3.1035, df = 48, p-value = 0.003203
alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
95 percent confidence interval:
0.3661298 1.7133448
sample estimates:
mean in group No mean in group Yes
8.320690 7.280952
We reject the null hypothesis that the mean number of hours of sleep per night is the same for those with a TV in their room and those without. The mean number of hours of sleep per night is significantly lower for those with a TV in their room (mean = 7.28) compared to those without a TV in their room (mean = 8.32) with p = 0.003.
We conclude that there is enough evidence in the data at the \(\alpha=0.05\) level of significance to state that there is a difference between the sleep duration of teenagers with and without a TV in their bedroom.
Is there a difference in average sleep duration between those who have smartphone use before bed?
Two Sample t-test
data: sleep by smartphone
t = 1.4354, df = 48, p-value = 0.1577
alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
95 percent confidence interval:
-0.2126135 1.2737246
sample estimates:
mean in group No mean in group Yes
8.075000 7.544444
We fail to reject the null hypothesis that the mean number of hours of sleep per night is the same for those with smartphone use before bed and those without. The mean number of hours of sleep per night is not significantly lower for who use a smartphone before bed (mean = 7.54) compared to those who do not use a smartphone before bed (mean = 8.08) with p = .1577
We conclude that there is not enough evidence in the data at the \(\alpha=0.05\) level of significance to state that there is a difference between the sleep duration of teenagers who do and don’t use a smartphone before bed.